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Distributed Consensus Algorithms 
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Abstract 

We study the effect of communication delays on distributed consensus algorithms. Two ways to 
model delays on a network are presented. The first model assumes that each link delivers messages with a 
fixed (constant) amount of delay, and the second model is more realistic, allowing for i.i.d. time-varying 
bounded delays. In contrast to previous work studying the effects of delays on consensus algorithms, 
the models studied here allow for a node to receive multiple messages from the same neighbor in one 
iteration. The analysis of the fixed delay model shows that convergence to a consensus is guaranteed and 
the rate of convergence is reduced by no more than a factor 0(B 2 ) where B is the maximum delay on 
any link. For the time- varying delay model we also give a convergence proof which, for row- stochastic 
consensus protocols, is not a trivial consequence of ergodic matrix products. In both delay models, the 
consensus value is no longer the average, even if the original protocol was an averaging protocol. For 
this reason, we propose the use of a different consensus algorithm called Push-Sum [Kempe et al. 2003]. 
We model delays in the Push- Sum framework and show that convergence to the average consensus is 
guaranteed. This suggests that Push-Sum might be a better choice from a practical standpoint. 

I. Introduction 

This article aims to and understand the effects of communication delays on discrete-time 
distributed consensus algorithms. We build on two frameworks to model delay that were proposed 
in |TJ. For a simple model assuming fixed delays on the directed edges of a communication 
network, the question of how much the consensus convergence rate deteriorates in the presence 
of fixed delays was left open in |TJ. Here we prove that if the maximum delay on any edge is 
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B, then the time to reach an eaccurate consensus in the delayed setting is no more than 0{B 2 ) 
iterations larger than that in the delay-free setting. For the fixed delay model, we generalize 
the construction of the random delay model presented in JTJ to use any arbitrary row stochastic 
consensus algorithm P without delays. Our second major contribution is a formal convergence 
proof for the time-varying delay model. Finally, we show how both the fixed and random delay 
models can by used with a different consensus algorithm called Push-Sum consensus [2]. For 
the random delay case we show that the delay model is simplified while convergence to the true 
average is still guaranteed. We conclude the paper with simulations that illustrate the effects of 
delays in distributed consensus computations. 

Our motivation to study communication delays comes from problems in distributed optimiza- 
tion and large-scale machine learning. The dramatic increase in available data has made the 
use of parallel and distributed algorithms imperative for large problems (see for example (3), 
Q). Among numerous alternatives, a significant amount of research has focused on developing 
consensus based algorithms Q - ® which combine some version of local optimization with 
a distributed consensus protocol running over a peer-to-peer network. With such an approach, 
all computing nodes have the same role in the optimization procedure, thereby eliminating 
single points of failure and increasing robustness. This is important in large scale systems where 
machines may fail during the computation. At the same time, consensus-based algorithms are 
simple to implement and avoid the bookkeeping required by algorithms using more structured 
routing. The consensus approach is also flexible and allows for adding more computational 
resources. On the other hand, peer-to-peer networks lack a highly organized infrastructure and 
coordinating the computing nodes becomes a challenge. Much of the recent analysis of consensus 
algorithms focuses on the case where communication is over a wireless network |9]]. 

For implementations of consensus-based optimization algorithms running on (wired) compute 
clusters, the issue of communication delays arises quite naturally. For example, in typical machine 
learning problems, the decision variable (and hence the message size) can quickly exceed many 
megabytes in size. During the time it takes to transmit such large messages, a modern processor 
can perform a significant amount of local processing of its own data, and the received information 
always appears to be delayed. In addition, cluster computing resources are typically shared among 
many users, and delays to one task are introduced if processors devote some of their cycles to 
other unrelated tasks. Finally, any network infrastructure is bound to have some fluctuation in its 
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performance for reasons beyond our control. It is thus important first to model communication 
delays, and then incorporate those models in the analysis of consensus algorithms to understand 
what the effects of delays will be. 

A. Contributions 

In this article we study communication delays in discrete time and study their effects on 
convergence of consensus algorithms, focusing on distributed averaging. The main contributions 
of the paper are the following: 

Consensus under Fixed Delays — The effect of delay on convergence rate: Previous work 
JTJ introduced a fixed delay model where transmissions over each directed link of a network 
experience some fixed amount of delay that does not exceed B. Starting with a doubly stochastic 
consensus protocol P it was shown that consensus is still achieved in the presence of fixed 
delays at an exponential rate which depends on the second largest eigenvalue of P, the modified 
consensus algorithm accounting for delays. In this paper we use geometric arguments to show 
that the rate of convergence does not get worse by more than an factor of 0(B 2 ). 

Random delay consensus under general row stochastic protocols: Given a strongly connected 
graph G, in [1] a construction is given for building a matrix P that describes the consensus 
updates on G under the assumption that each message experiences a random amount of delay 
that does not exceed B iterations. Here, we generalize this model so that P = P(P); i.e., P is 
constructed from a given row stochastic consensus protocol P defined on G without delays. 

Random delay consensus — Convergence proof for row stochastic protocols: If the initial pro- 
tocol P on a graph G without delays is row stochastic, using the proposed random delay model, 
the consensus dynamics are captured by a sequence of matrices P(t) which may contain all- 
zero rows. This means that although the consensus updates remain linear, convergence cannot 
be established based on standard theory for stochastic matrix products. Here we give a complete 
proof of convergence under this random delay model. 

Delays under Push-Sum consensus: We study a different consensus algorithm called Push- 
Sum consensus Q which uses column stochastic matrices. We show that convergence properties 
of Push-Sum are not affected in the presence of delays, and the aforementioned convergence 
results and bounds still apply. In particular, it is noteworthy that consensus on the average is 
guaranteed even in the presence of bounded random delays. 
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B. Paper Organization 

The rest of the paper is organized as follows. We first summarize our notational conventions 
in Section |I-C[ Section [TT] reviews related work and Section [TTT] briefly reviews the consensus 



problem. The fixed delay model and related results are given in Section [IV} Next, Section [V] 
describes and analyzes the random delay model. Illustrative simulation results appear in Sec- 



tion VII, and the paper concludes in Section VIII with a discussion of possible extensions and 



future work. 



C. Notation 



We use bold to indicate vectors; e.g. x. Time t is always discrete and time dependence is 
shown as x(t). Vectors are indexed by subscripts, i.e., Xi(t) or [#(£)] * when it is more clear. For 
a set of indices S, by xs we mean the entries of the vector x corresponding to the elements in 
S, and to index the range of indices from i to j in the vector x we use the notation [x(t)] i: j. 
Capital letters are used for matrices and we write p^ , P(i,j) or [P\ij for the element in row i 
and column j of matrix p\ we also write [P] i? . for the i-th row and [P] : j for the j-th column. A 
matrix transpose is denoted by P T . In many contexts we talk about a quantity such as a graph 
G or a matrix P and the corresponding quantity in the presence of delays. We write G and P 
to denote versions of G and P under the delay model. The vector of all ones is indicated by 
1 and the vector of all zeros by 0. We use a subscript to show the dimension of the vector, as 
in l n , when it is not clear from the context. We also use the indicator function l[event] which 
is equal to 1 if the event is true and zero otherwise. For a graph G = (V, E) to talk about a 
directed edge from node i to node j we may use (i, j) or i —> j or just a superscript ij. 

II. Previous Work 



There is a rich literature on distributed averaging algorithms; see |9), JT0| and references 



therein. A lot of effort has been focused on analyzing the rate of convergence to the average 
consensus (TTJ. The connection between consensus protocols and the convergence of Markov 
chains JT2) reveals that the spectral properties of the underlying network play an important role 



in the convergence rate. Of practical interest are asynchronous consensus algorithms. In p3 \ 
is it shown that using asynchronous broadcasts and forming convex combinations of incom- 
ing information guarantees convergence to the average only in expectation. For time-varying 
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protocols, fT4| provides necessary conditions under which convergence is achieved while JT5 1 



characterizes the expectation and variance of the consensus value. Interestingly, in this paper we 
show that convergence to the true average under the same conditions for time varying protocols 
is guaranteed when using a different type of algorithm called Push-Sum j2), |16). 



The main focus of this work is the effect of communication delays on consensus algorithms. 
For applications in partial differential equations, distributed control and multi-agent coordination 
(T7J, (18]] and |19j, [20) analyze continuous-time delay models where all messages incur the 
same constant delay. Our motivation comes from applications in distributed optimization where 
both computation and communication happen in rounds and take a significant amount of time. 
For this reason we focus on discrete-time models. An early treatment of delays in discrete-time 
distributed averaging algorithms can be found in (21), where it is proved that convergence is not 
guaranteed if delays are unbounded. An analysis of conditions for convergence in the presence 
of delays is given in [jTTJ. Closer to our work are p2| , (23) and (24) which model delays in 



discrete time for consensus problems by augmenting the state space with delay nodes. However, 



in (22) the value to which the consensus algorithm asymptotically converges is not characterized. 
The model in (24) accumulates all the delayed information in a single delay node and does not 
allow for delivery of messages out of order. The model in (23) has the same expressive power 



as our random delay model, although the equation describing the consensus dynamics in [ [23 1 
does not allow for receiving multiple messages from the same sender in one iteration. 

III. Distributed Averaging 

Assume each node i e V in a strongly connected network G = (V, E) of \V\ = n nodes holds 
a value V{. We stack the initial values in a vector x(0) = (ui, . . . , v n ) T . The general consensus 
problem asks for a distributed algorithm such that the nodes of the network exchange messages 
with their neighbours and update their state to reach consensus i.e., x(t) —> cl as t —> oc. In 
other words, we want the nodes to agree on a common value c using only local communication. 
It follows from Perron-Frobenius theory [[25) that if we choose a row stochastic matrix P that 
respects the structure of the graph in the sense that pij ^ if (j, i) G E, consensus is achieved 
by the iteration 

x{t) = Px(t-1) = P*x(0). (1) 
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The reason is that Pl = 1 and 1 is the unique eigenvector corresponding to the eigenvalue 1 
while all the other eigenvalues have magnitude less than one and their contribution vanishes if we 
consider the eigendecomposition of P l as t —> oo. As a result, P l converges to a rank-1 matrix 
where each row is equal to the stationary distribution 7r of the Markov chain associated with 
P. In the special case where 1 T P = 1 T , the matrix P is doubly stochastic and the consensus 
value is c = ^; i.e., consensus is achieved on the average. Some situations may require using 
a protocol which corresponds to a row stochastic update matrix P, e.g., because G does not 



admit a doubly stochastic matrix [26]. In such situations, if the stationary distribution 7r of 
P is known in advance then consensus on the average can still be achieved by rescaling the 
initial values by (mr^) -1 [ |27] . Reaching consensus on the average is particularly important in 
distributed optimization since, if consensus is achieved on a value other than the average, an 



undesired bias is introduced [28]. 



When the protocol P is fixed, the update ([]} represents a synchronous algorithm where 
all nodes transmit information to their neighbours at the same time and each node receives 
exactly one message from each neighbour at each iteration. If we want to model scenarios 
where nodes communicate asynchronously or, as we will see below, if we want to model 
random communication delays where information may arrive in a different order than it was 
transmitted and we receive an unknown number of messages from each neighbour, we must 
consider time-varying protocols P(t). The situation now becomes more involved as we may 
not be able to specify the stationary distribution to which the algorithm converges beyond its 
mean and variance (15]]. Furthermore if we restrict to protocols where each node only transmits 
information without expecting a response — i.e., one-directional communication — using time- 
varying doubly stochastic protocols becomes impossible without extra coordination, while row 



stochastic protocols only converge to the average in expectation [ [13] ]. For these reasons, in the 
following we also consider a different type of consensus algorithm called Push-Sum consensus 
which does not have these limitations in the time-varying case. 

IV. Fixed Communication Delays 

We first analyze a model where the delay over each communication link does not vary with 
time. This is generally not true in practice but a fixed delay model can be appropriate in an 
average sense when the true delay does not fluctuate too much. An open question in (TJ for this 
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model, is how does the convergence rate of consensus with fixed delays depend on the maximum 
delay B. After reviewing the fixed delay model, we provide an answer below. 

Note that for the rest of this section, whenever we talk about a quantity Q, such as a graph 
or a matrix, we use a hat (i.e., Q) for the transformed version of Q in the presence of delays. 

A. Fixed Delay Model 

Assume that in a given network G, for a directed link every message from i to j is 

delayed by 6^ time units. We model this delay by replacing the link with a chain of 6^ 
virtual delay nodes in the network, acting as relays between i and j. This leads to a network G 
which contains the original compute nodes, V, as well as b = J^(ij)e£7 6^- delay nodes. Our goal 
is to study the corresponding consensus protocol running over G. We assume that a consensus 
protocol P in the delay-free network G is given so in the presence of delays, the compute nodes 
still transmit and combine incoming messages using the weights provided by P. In [1], we 
describe how to construct a stochastic matrix P in the augmented space of n + b nodes starting 
from a delay-free consensus protocol P. The matrix P encodes communication of information 
between delay and compute nodes and has a stationary distribution n which is not uniform and 
depends on both P and the edge delays. We clarify that the augmentation of G with delay nodes 
is done just for the purpose of modelling and the analysis; no physical delay nodes are actually 
added to the network. 

To illustrate the construction of P from P , consider a graph G with 3 nodes. Suppose that 
the delay-free consensus protocol is specified by the matrix 

2 1 o 

3 3 W 

P= i i i . (2) 

6 3 2 v J 

111 

6 3 2 

To model a fixed delay of 2 whenever node 1 transmits to node 2, we augment G with two 
delay nodes d\~^ 2 , d\~* 2 so that information from 1 to 2 must pass through them first. In the 
augmented graph G, the consensus protocol is described by a row stochastic matrix P. Using 
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l-»2 j1^2 



2 
3 



1 



1 









1 



1 

6 







(3) 



Each receiving node forms a convex combination of the incoming messages so in P, node 2 
receives information from node d\~* 2 with weight | because p 2 ,i = \- 

Using P we can analyze the effect of delays on convergence based on the update equations 
for row stochastic consensus 



x(t) = Px(t- 1), 



(4) 



where x(t) is the augmented state vector of dimension n + b containing values for the compute 
nodes and virtual delay nodes. If P is doubly stochastic, our previous work [1] provides an 
exact characterization of 7r, the stationary distribution of P. Let us index the directed edges of 
G (without delays) by r = 1,2, ... , to. We use the notation (i(r),j(r)) to specify that edge 
r starts at node i and is directed to node j. Moreover, let b r denote the amount of delay on 
edge r, and with a slight abuse of notation, let 7r r denote the value of the stationary distribution 
vector for all delay nodes in the chain replacing edge r. The stationary distribution of P has the 
structure 

* = [KvK ^l 1 ^! ^rnhj ? (5) 



and the exact values are 



7TV 



U + brPi(r)j(r) ' ^ + J2r ^rPi(r)j(r) 

In the special case where P is a max-weight doubly stochastic matrix^ the entries of 7r only 
take one of two values, one for the compute nodes in the set V and one for the delay nodes i.e., 



! For an undirected graph G without self loops, with adjacency matrix A and node degrees v = [degi . . . , deg n ] the max-weight 
matrix is defined as P = I diag(v)-A ^ . doubly stochastic. 
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it does not matter how the delays are distributed over the links. Specifically, denoting by C the 
set of delay nodes we have 



d + 1 



1 



7l C 



(7) 



b + n(d max + 1) ' ~ b + n(d max + 1) 
where d max is the maximum degree of G viewed as undirected ignoring self-loops. 

Notice that even when P is doubly stochastic (and thus admits average consensus), the row 
stochastic delayed protocol P does not converge to the average in general, since its stationary 
distribution is not uniform. To converge to the average with P we need to rescale the initial 
values as explained in Section III, using the stationary distribution of P. 

By construction, the delay nodes only relay information and have no self loops. Thus, the 
diagonal entries in P corresponding to delay nodes are zero. This makes P a non-reversible 
Markov chain that is not strongly aperiodic^} and the majority of known convergence rate results 
for Markov chains do not apply. To get a bound on the convergence rate under fixed delays, 
we apply the result from |29fl with the lazy version P\ azy = \ (I + P) of P. First, the additive 
reversibilization of a Markov chain with transition matrix P is defined by: 

P + P 



U(P) 



(8) 



where P is the time-reversed chain. Next, since Pi azy is non-reversible but strongly aperiodic 



and converges no more than two times slower than P, applying Fill's result [29| we have 



[P f 



7T 



< 



TV 



< 



[Plazylh- ^lazy 



2 

TV 



(9) 



^lazy]i 
With 7Tiazy = ^. 

Our initial work (TJ left open the question of to what extent delays effect the convergence 
rate of average consensus protocols. One way to address this is to understand how much larger 
is \2{U{Piazy)) m comparison to A 2 (P). We provide an answer next. 



2 A Markov chain is strongly aperiodic if all the diagonal entries of its transition matrix are at least 1/2. 
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B. Effect of Delays on Second Eigenvalue 

The convergence rate of a consensus protocol P to stationarity in terms of total variation 
distance can be bounded by A 2 (P), the second largest eigenvalue of P. The second eigenvalue 



in turn can be bounded using a geometric argument based on the Poincare inequality [29], 
(30|. The intuition is to look for the bottleneck edge which limits the flow of information and 
consequently the convergence speed. Assume the stationary distribution of P is 7r. For each pair 
of nodes {x, y} of G, we choose a (directed) path j xy from x to y. To identify bottlenecks we 
look at how many paths ^ xy go through the same edge. A measure of bottlenecks in G, is given 
by the Poincare constant, 



K = max 

e=(v,w) 



^ 1 \lfxy\ ^x^y 



x,y s.t. eej X y 



(10) 



where \j xy \ is the length (in number of edges) of the path ^ xy . The constant K quantifies the 
load on the most heavily used edge. Less formally, that involves identifying an edge through 
which many and long paths must pass for pairs of nodes to communicate over G. In addition, 
the paths are assigned an importance based on the stationary distribution value at the endpoints. 
Depending on the quality of the paths, we get a more accurate characterization of bottlenecks. 
Given a set of paths Y = {^ xy ]^ the Poincare constant gives a bound on the second eigenvalue 
of P: 

A 2 <1-^. (11) 

Our goal is to use a given set of canonical paths Y for G to construct a set of canonical paths in 
G, the augmentation of G after adding fixed edge delays. This will reveal how the delays effect 
the convergence rate we have for P. To that end, we compute the Poincare constant for G as a 
function of the Poincare constant of the original graph G. 

Since P represents a non-reversible Markov Chain, we consider the lazy additive reversibi- 
lization U{Pi azy ) which is strongly aperiodic, reversible, has the same stationary distribution as 
P 9 and whose convergence rate bounds that of P. With the exception of some added self loops 
on the delay nodes, the graph structure compatible with U(Pi azy ) is the same as that of P. To 
compute the Poincare constant K for G we start with some observations and consequences of 
augmenting G with fixed delays. We assume that the maximum delay on any edge is B and we 
use subscripts to index the nodes on a delay chain. 
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Fig. 1. (Top) A path j xy in G. (Bottom) After adding delays in G, all paths from nodes {x~ , x, x + } towards nodes {y~ ,y,y + } 
are associated with the same path ^y xy . If e — (y,w) was a bottleneck edge in G, edge e in the middle of the delay chain that 
replaced e will be a bottleneck edge in G. 



1. We claim that if e = (v, w) is the bottleneck edge in G with no delays, all edges on the 
delay chain v —> di — >> • • • — >- d B > —> w, B' < £?, that replaces e in G are bottlenecks in G. The 
reason is that if a flow needs to go through e in G, it will have to go through all of the delay 
edges replacing e in G. This is true because the degrees of the compute nodes do not change 
by adding fixed delays the way we described above, and the paths between the compute nodes 
are just elongated without offering new path alternatives. As a result, to compute the Poincare 
constant of U (Piazy) we do not need to maximize over all edges in G. Instead we only examine 
edges in the middle of delay chains. That is, if a delay chain connecting compute nodes a and 
b has length B', we only consider the edge e = {d" b B , , d" b B , ). 

2. We intend to use the given collection of canonical paths Y on G to derive a bound on 
the Poincare constant of G. The graph with delays has more nodes and thus more paths to be 
considered. However, we can associate a collection of paths of G with the same path in G using 
the compute nodes as identifiers for each path. The key point is to ensure that if a path ^ xy 
goes through an edge e of G, then in G we have a set of paths {% y } identified by the same 
compute nodes x —> y. All those paths go through e, the edge in the middle of the delay chain 
that replaced e in G. By forming this path association, the expression for K will appear in the 
bound for K. Figure [l] illustrates the path association. 

We distinguish the following nine cases. If x,y are compute nodes in G, we associate % y ~ 
^xy. Note that \^j xy \ < (B + 1) \ j xy \ when the maximum possible delay per edge is B. Next, 
to consider paths to or from delay nodes, we associate a delay node with the compute node 
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that is closest to it in the direction of the path. Let us use the notation x~ to denote delay 
nodes before x associated with paths through x, and x + to denote delay nodes after x. For 
each path ^ xy of G going through edge e, we identify different cases of paths in G going 
through e (the middle edge in the delay chain that replaces e). We have eight possibilities: 
x —> y~ , x —> y + , x~ —> y~, x~ —> y, x~ —± y + , x + — >■ y~ , x + —> y, and x + —± y + . 

3. To get a cleaner expression for the bound, assume that P is doubly stochastic. In that case, 
from ([6]) we see that the stationary distribution of the compute nodes in the presence of delays 
is tt x = where c = n+ ^ r ^MiMo) m Moreover, for all compute nodes x, we have tt x > p9 x - 
and tt x > pn x + where p max^j p^ . 

With the above considerations in mind, we start from the definition of the Poincare constant 
for G: 

1 ^ 



K = max 

h=(a,b) 



x,y s.t. hejxy 



(12) 



Let e = (v, w) be a bottleneck edge of G. This means that the edge e in the middle of the delay 
chain that replaces e will be the bottleneck in G. After some algebra we can bound K with an 
expression that involves K (from fLO])). Besides the leading constant involving the bottleneck 
edge, we need to break the sum over the canonical paths into summands according the nine 
cases we described in consideration 2 above. We refer the reader to the appendix for a proof 
and we state here the final result. 

Theorem 1: Let G be a network endowed with a doubly stochastic consensus protocol P and 
a set of canonical paths Y yielding a Poincare constant K. Then adding fixed delays up to B 
on the edges of G yields a Poincare constant K for the delay graph G for which 

K < ZK, Z =^ [p 2 (2d 2 max + 3d max + 1)B 3 

+ p(2pd 2 max + 2pd max + 8d max + 6)B 2 

+ (8pd max +p + 8)B + 8^ (13) 

where (v,w) is a bottleneck edge in G, p = max^ypy, c = n+ ^ r *^ Pr (»Mj) anc [ j s ^ 
maximum degree in the undirected graph G ignoring self-loops. 

Theorem [l] yields a bound in the second eigenvalue and thus the spectral gap of P. 

Corollary 1: Suppose a doubly stochastic protocol P on a graph G has a spectral gap 1 — 
^2(P) > and assume that messages over the edges of G experience arbitrary fixed delays of 
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up to B iterations. Then the spectral gap of P is reduced by at most a factor 6(£? 2 ); i.e., 

l-A 2 (P)>-^, Z = 0(B 2 ). (14) 

Proof: From Theorem [l] we have A 2 (P) < A 2 (£7) < 1 — Since fe r < r = 1, 2, . . . , m 
we see that c = ^rmn = e(£) and thus Z = 6(£ 2 ). ■ 
To the best of our knowledge this is the first result to describe the effect of a bounded fixed 
delay on the convergence rate of average consensus. It shows that the delays cannot slow down 
consensus by more than a polynomial factor and convergence remains exponentially fast. 

V. Time Varying Communication Delays 

To capture real network volatility, it is more appropriate to assume that link delays vary 
randomly with time. In (TJ, a discrete-time random delay model is presented. However the 
construction only applies to uniform consensus weights (i.e., where P is the natural random 
walk on G), and convergence to consensus is only verified in simulation. Here, we generalize 
the construction of the model from |TJ to use any row- stochastic protocol and we present a 
formal convergence proof. 



A. Random Delay Model 

Similar to the fixed delay model, we add virtual delay nodes. We assume again that delays are 
finite and upper bounded by a maximum delay B. As emphasized in (TJ, with random delays 
in discrete time we need to be careful. Others have previously analyzed a consensus update of 
the form 

n 

Xi(t + 1) = ^PijXjit - bij(t)), (15) 

3=1 



where is the random delay experienced by link at time t [19|, [23|. However, this 

type of update implies that at time t each node i will only receive a single (possibly delayed) 
message from each neighbour j. In practice this may not be true. For example, take an edge 
whose delay could be 1 or 2. Assume at iteration t node i sends a message m t to j and 
at time t + 1, i sends a new message m*+i to j. If m t is delayed by 2 time units and m*+i is 
delayed by 1 unit, then both m t and m t+ i will be delivered to node j at time t + 2. This scenario 
can easily occur in practice when messages are large in size and receiving a message takes a 
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Fig. 2. Adding a random bounded delay on edge (1,2). At this particular instant, 1 sends with delay 2 since the connections 
to delays 1 and 3 are deactivated. 

non-trivial amount of time during which a second message can arrive. When this happens, the 
receiving node polling its buffer experiences the arrival of two messages during the same time 
slot. 

To model random bounded delays, we replace each directed edge of the original graph with 
multiple delay chains of varying lengths to model varying amounts of delay. Every time a 
message is sent, a random decision is made for which delay chain the message will take to 
reach its destinatioij^j If a communication network with n computing nodes has m directed 
edges (not counting the self loops), each edge delivers messages with some bounded delay that 
is randomly chosen between and B. For example for an edge with a maximum delay 
of 3 we augment in G with three parallel delay chains (d\), (df , d\), (d\ , d\, d\) in G\ see 
Figure [2} We avoid indexing the delay nodes by edge number to not clutter notation. We augment 
the graph with B ^ 1 " > delay nodes per edge or b = mB( ^ +1 ) delay nodes total, where m is the 
number of edges in G. We also allow for messages to be delivered without delay, by including 
the directed edges of the original graph G. 

Our goal is to write a matrix P(t) that will describe the consensus dynamics under random 
delays using linear updates. Our previous work (TJ presented a model for the simple case where 
all incoming messages receive equal weight (proportional to the number of neighbors). To address 
the general case, we assume here that we are given a row stochastic protocol P for the graph 
G, and we construct P(t) using the weights suggested by P. 

3 Of course in reality this random choice is made by the environment, i.e., the network, and is beyond our control. For modeling 
purposes to emulate and understand the effect of delays, we can draw a random sample from a distribution that we believe 
resembles how real network conditions fluctuate. 
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Every time a message is sent, it is routed randomly through one of the B delay chains or the 
direct edge with zero delay. Outgoing edges to the other chains leading to the same recipient are 
cut off. Here we consider a time-varying delay model where each message experiences a delay 
that is i.i.d. from delays on other messages on different edges and different time moments. For 
more accurate modelling, we can impose any discrete probability distribution on the integers 
0, . . . , B to control the expected delay of an edge. This does not effect the convergence analysis 
presented below. 

As we see, the augmented graph topology changes at every iteration based on which outgoing 
edges to delay chains are active. To describe the consensus update equations we need to model 
the changing topology. At each iteration, a delay is sampled for each message to be transmitted. 
Based on these delays, at iteration t the graph adjacency matrix A(t) is a sample from the set 
{A 1 , . . . , yl( B+1 ) m } of possible adjacency matrices. Notice that a delay node could either contain 
a message or be empty, and a zero message is not the same as the node being empty. To keep track 
of which delay nodes are empty we define an indicator vector sequence <j>{t) € {0, l} 6 . 

Using A(t) and cj)(t) we show how to write a transition matrix P(t) at each iteration t. 

We begin by noticing that adjacency matrices A{t) have the structure 

R(t) Cbxb 

Matrix A(t) should be interpreted as a directed graph adjacency matrix. Element [>!(£)] ^ is 1 if 
there is a directed link from j to i. Its constituent parts L(t), J n xb, R(t), and Cbxb are described 
next. 

The upper left block is an identity matrix to represent the self-loops plus a random n x n 
square matrix L{t) with zeros on the diagonal and a one at position if compute node j 
sends a message to compute node i with zero dela^ at iteration t. Matrix R(t) is b x n and is 
also a random matrix. Whenever a compute node i transmits to another compute node j using 
delay chain r = 1, . . . , B, matrix R(t) will encode that random delay choice for time t. For 
example, if at time t node j sends a message to i which is delayed by 2 steps (so that it will 
arrive at time t + 3), R(t) will contain a block for edge (j, i) indicating the delay chain that is 

4 Note that zero delay means that a message sent at iteration t will be delivered at iteration t + 1, i.e., without any delay. 



A(t) 



(16) 
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active, as illustrated in equation ([IT]). 



R(t) = 
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• •• 


• 



(17) 



Element (rff, j) of R(t) is 1 since j will transmit to the first delay node in the chain of length 
2 towards i. The entries that are not shown within each block are all zero. 

Matrix J nxb describes the connections between the delay nodes d r r at the end of each delay 
chain delivering messages to the compute nodes. The part of J nxb corresponding to the edge 
(j, i) of R(t) just discussed will look like 



u nxb 



4 
4 



4 



4 



4 



4 
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• 1 •• 
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•• 


• •• 
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•• 


• 1 •• 
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• •• 


• 


•• 


• 1 •• 


• 



(18) 



I.e., for edge j —> i, the entries (j, d\) : (j, d|) an ^ C??^!) m ^(*) are a ^ 1- Finally, we define 
the matrix for forwarding messages from one delay node to the next on each chain. On a 
specific delay chain of length h, messages are forwarded through the action of an h x h Toeplitz 
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forward shift matrix with Is on the first lower diagonal, i.e., 

"o ••• 



Sh 



1 
1 



(19) 



10 

_0 ••• 1 0_ 

For any edge r = 1, . . . , to, to forward messages through all delay chains we use a block diagonal 
matrix K r = diag(Si, 62, . . . , Sb)- Finally, since we have m edges 



C bxb = diag(Ki, K 2 , . . . , K m ). 



(20) 



Looking back at ( fT6] ), observe that every row of [R(k) C bxb ] contains at most one non-zero 
element and there are rows that are all zero. 

Next, we define an indicator vector <j>{t) e {0, l} b that keeps track of whether a delay node 
on any delay chain contains a message or is empty. Initially we have 0(0) = b . At iteration t, 
the first nodes in the delay chains may receive new information depending on which edges are 
activated by R(t). The rest of the delay nodes will be non-empty only if their predecessors in 
the chains were non empty in the previous iteration. In other words, cj)(t) evolves as 



<Kt)=R(t)l n + C hxb 



!)• 



(21) 



After understanding the structure of the time- varying adjacency matrices A(t) 9 to describe the 
consensus transition matrices P(t) we need to specify the weights used to combine incoming 
messages. Recall that each computing node might receive multiple messages from a neighbouring 
computing node, each arriving via a different delay chain. We will assign equal weights to all 
incoming messages from the same sender, and messages from different senders will receive 
weights according to P. For example, suppose compute node i receives + L^(t) messages 
from node j where < < B are the delayed messages and Lij(t) = or 1 is a message 
without delay. Node i will assign a weight $..+1..^ to each of those messages. In this setting, 
the self-loop message from i to itself will take weight pa + J2k=i ^i w ik + L ik {t) = 0]p ik where 
the sum is over all neighboring nodes k from which i does not receive anything at iteration t. 
Define $(£) = diag(0(t)). We can determine which delay nodes at the ends of delay chains have 
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information to be delivered by taking the product J n xb&(t — 1) and locating which entries are 
1. Thus, to construct P we locate all the entries equal to 1 in matrix J nX b&{t — 1) at row i and 
columns corresponding to deliveries from j, and replace them by ( t y If Lij(t) = 1 we 

also need to replace that entry with ( t y With a slight abuse of notation let us describe with 

P[L{t)} and P[(j)(t — 1)] the operators that replace the Is in L(t) and J nX b&{t — 1) respectively 
with weights using P. If node i receives no messages from neighbour j, then + L^{t) = 
and we transfer the weight pij to the self-loop message of i. The transition matrix P(t) is now 
written as 

r p 1;1 (t) p[<p(t-i)} 



P(t) = 



(22) 



P hl (t) =1- diag(P[0(t - 1)]1 5 + P[L(t)]l n ) + P[L(t)}. (23) 

The upper left block of P(t) has this form since for any row stochastic matrix P, we have 
Pa + Y2=i M^ik + Uk{t) = 0}Pik = 1 - ELi + L ik (t) > 0]p ik for each compute node i. 
This is just another way of saying that the portion of the weight not used on incoming messages 
at compute node i from other neighbours is reassigned to the self loop message. 

Observe that the rows of P(t) either sum to zero or to one. Each row i for i < n (corresponding 
to a compute node) is stochastic by construction, while each row iforn<i<n + b 
(corresponding to a delay node) contains at most a single 1 and all other elements are 0. A row 
i > n corresponding to a delay node d\ will be a zero row if the compute node at the source 
of the corresponding edge did not send a message through the delay chain r. Let x(t) G R n+6 
denote the augmented state vector of compute and delay nodes. The consensus update equations 
using P(t) are now 

x(t + 1) =P(t + l)x(t), t>0 (24) 



where to construct P(t + 1) we need to first update the vector <fr(t) according to ( |2T| ). 

The presence of zero rows makes the transition matrices P(t) not stochastic so we need a 
convergence proof specific to this family of matrices. As we see later, one advantage of Push-Sum 
consensus is that it simplifies the random delay model and we do not have this complication. 
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B. Convergence under Random Delays 




Definition 1: A square matrix M is non-expansive with respect to a norm ||-|| if for any vector 
x, we have ||Mx|| < 

Definition 2: A square matrix M is paracontracting with respect to a norm ||-|| if for any 
vector x, we have || Ma; || < \\x\\ whenever Mx ^ x. 

From the construction of the random delay matrices, it is easy to see that the graphs repre- 
sented by the adjacency matrices A(t) are all connected, and in addition, every compute node 
performs an averaging operation of the incoming messages. We can thus show that the product 
of sufficiently many consecutive matrices P(t) is a contractive mapping, leading to convergence. 

Theorem 2: The product P2B+i(t) = Ilsfo^* + s ) of 25 + 1 consecutive random delay 
matrices is non-expansive with respect to the infinity norms H'H^ and || - 1| _ 00 . Moreover, for some 
integer r > 1 that depends on the network topology, the product P r (2B+i)(t) is paracontracting. 
As a result, every non-empty node i such that 1 < i < n + b and <^(£) > converges almost 
surely to the same value; i.e. Xi(t) —> v as t —> oc. 

Proof: Consider the linear random delayed consensus updates subsampled at intervals of 
2B + 1 iterations: 



Recall that in parallel to x(t) we have to evolve the vector <fi(t) which indicates which delay 
nodes are empty. To focus on the non-empty nodes, define the vector y(t) such that yi(t) = x(t) 
if (j)i(t) > and y^t) = -oc if <^(t) = 0. 

Let us observe that the maximum value of y(t) is either equal to or smaller than the maximum 
value of y(t — 1). If a compute node i < n holds the maximum value of y(t — 1), in B + 1 
iterations it is certain that i will receive a message from a neighbouring compute node j < n. 
If at least one neighbour of i has a smaller value than z, then the value of i will be reduced 
because i will set its new value to a convex combination of the more than one incoming messages 
(including the self message). However, i may send its (maximum) value to a node k < n through 
the delay chain of length B at iteration t. Regardless of whether the value at i is reduced or 



x(t) = P 2B +i(t)x(t-l), t = 1,2,... 



(25) 
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not, the maximum of y(t — 1) will not change while it is traversing the delay chain towards k. 
When the message reaches fc, node fc's value will be reduced unless all of its neighbours have 
sent messages to k equal to the maximum. To summarize, the maximum value of y(t — 1) after 
2B + 1 iterations will either stay the same or be reduced. The maximum value will not change if 
multiple nodes hold that value and there exist at least one node with no neighbours that contain 
a smaller value. As a result, the maximum value of the state vector will certainly be reduced 
after r{2B + 1) where r = 1, 2, ... is defined as follows. Assume a node i holds the maximum 
value of y (t — 1). If at least one neighbour of i holds a smaller value, then r = 1. If all nodes 
in the distance 1 neighbourhood N 1 ^) of i also contain the maximum value then r = 2. If the 
neighbours of the neighbours N 2 (i) = A^ 1 (A^ 1 (i)) of i contain the maximum value then r = 3 
and so on. Notice also that if the delay nodes were real nodes initialized with random values 
such that a delay node contained the maximum value in y{t — 1), then that value would reach 
a compute node and would be reduced via an averaging update in at most B + 1 iterations. We 
have shown that is non-expansive with respect to 1 1 - 1 1 c=>0 - Similarly, since averaging a set 

of numbers increases the smallest number in the set, P2B+i(t) is also non-expansive with respect 
t0 ll'll-oo if we define y\t) so that y[(t) = +oc if ^(t) = 0. Moreover, for a given network, we 
have shown that there exists an integer r such that P r (2B+i) (t) certainly reduces the maximum 
value of y(t — 1) and increases the minimum value of y'(t — 1). In other words, every product 
P r (2B+i)(t) is paracontracting and thus every r{2B + 1) iterations the minimum and maximum 
values in the graph come close together and thus must converge to the same limit uGR. ■ 

Even though Theorem [2] establishes convergence to consensus under random delays, the actual 
consensus value v is difficult to characterize since it depends on the specific realization of the 
process — i.e., on the random matrices P(t) used at every iteration. As future work, it might be 
possible to extend the results of (15) to describe the statistics of v 9 however the extension is 
non-trivial since their results are based on the assumption that all the involved matrices do not 
have zeros in the diagonal which is not the case in our model. Here, we show that, as one might 
expect, v is a convex combination of the initial conditions. We achieve this by showing that the 
top left n x n submatrix of P(t) is a row stochastic matrix for all t. 

After t + 1 steps we have 

x(t + 1) = P(t + l)P(t) • • • P(l)x(0). (26) 



July 26, 2012 



DRAFT 



21 



The product Yi k =i P{h) * s a matrix with block structure 




(27) 



where matrix Mi(t) is n x n and M 2 (t) is 6 x n. So we have 



x(t + 1) =P(t + l)M(t)x(0) 



2(0). 



(28) 



From the last equation, we obtain two recursions 



M 2 (t+1) 



Mi(t + 1) 



(j nxn - diag (Pm)]l b + P[L(t + l)]l n ) 
+ P[L(t + l)])Mi(t) + P[cf)(t)]M 2 (t) 

i?(t+l)M!(t) + ax6M 2 (t). 



(29) 



(30) 



We will show that Mi(t) is row stochastic for all t and that it converges to a rank-1 matrix. 
We begin by proving some intermediate lemmas and then proceed with the proof of the main 
theorem. 

Lemma 1: For all t, M 2 (t) and <f>(t) have non-zero rows in exactly the same positions. 
Proof: We will proceed inductively, using the expressions for how M 2 (t) and (j>(t) evolve. 
We have 0(1) = R(l)l n + C hxh (f)(<d) = R(l)l n and M 2 (l) = R(l) so clearly the non-zero rows 
of R(l) are the non-zero rows of M 2 (l), and they also result in non-zero entries of 0(1). For the 
inductive step, let us assume that <f>(t) and M 2 (t) have non-zero rows in the same positions. At 
stepi + 1 we have </>(t+l) = R(t + l)l n + C bxb cj)(t) and M 2 (t + 1) = i?(t + l)Mi(t) + C 6x6 M 2 (t). 
If row i of (j>(t) and M 2 {t) is non-zero, then due to multiplication by the shift matrix C^, row 
i+ 1 of (j)(t + 1) and M 2 (t + 1) will be non-zero. Moreover, if a row i of i?(t + 1) is non-zero then 
obviously row i of </>(t + l) will be non-zero. For M 2 (t + 1), we look at the term i?(t+ l)Mi(t). 
Observe that M\(t) has non-zero diagonal entries for all t. This is easy to see by the update 
equation ( [29] ) for M\(t). As a result, the product R(t + l)Mi(t) will yield non-zero rows of 
M 2 {t + 1) wherever a row of R(t + 1) is non-zero. This completes the inductive step of the 
proof. ■ 
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The next two lemmas are also inductive, and they are coupled in the sense that their proofs 
use each other's inductive hypothesis. Specifically, assuming that M\(t) is row stochastic and 
the non-zeros rows of M 2 {t) sum to 1, we show that the non-zeros rows of M 2 {t + 1) sum to 1 
and Mi(t + 1) is row stochastic respectively, establishing that both properties are true for all t. 

Lemma 2: The non-zero rows of M 2 (t) sum to 1 for all t. 

Proof: Initially, M 2 (l) = and the base case is true. Suppose for every non-zero row 

1 < i < b of M 2 (t) that ^ =1 [M 2 (*)]ij = 1. Also by inductive hypothesis, suppose that M ± (t) 
is row stochastic. We will show that the non-zero rows of M 2 (t + 1) sum to 1. Take any row 
1 < i < b of M 2 (t + 1). We have 

n n 

Y\M 2 {t + l)]y = + ^Ml(t) + CftxbMa^ly 
3=1 3=1 

n n 
= + + Y^[ C bxbM 2 (t)] ij . (31) 

3=1 3=1 

Given the way the delay nodes are arranged in the random delay model, row i of R(t + 1) 
corresponds to a delay node d r r \ such that 1 < r 2 < B and r\ < r 2 . By definition, row i of 
R(t + 1) will be zero if r\ > 1 and may be non-zero if r\ = 1. We thus distinguish two cases: 

• Case r\ = 1 : By definition all rows of 65x6 corresponding to delay nodes at the beginning 
of delay chains (identified as d[ 2 ), are zero. If row i = d[ 2 of R(t + 1) is non-zero, it will have 
all entries equal to zero except one entry equal to 1 at some position 1 < q < n. As a result 

n n n 

X)[M 2 (t + l)]y = + l)M!(t)]y + X;[C 6x6 M 2 (t)]y 

.7=1 .7=1 J=l 

= ^[MxCt)]^ + J2 li M ^)h = T^M^ = 1, (32) 

.7=1 .7=1 .7=1 

since, by inductive hypothesis, Mi(£) has stochastic rows. Of course, if row i of R(t+1) happens 
to contain only zeros, then the i-th row of M 2 (t + 1) will be a zero row too. 

• Case ri > 1 : In this case ^ =1 [i?(* + l)Mi(t)]^ = and 

J][M 2 (t + l)] y = J2[C bxb M 2 (t)] ir (33) 
j=i j=i 

Since is just a shift matrix, each row i > 1 of M 2 (t + 1) will equal to the row i — 1 of 
M 2 (t) which by inductive hypothesis sums to 1. The first row of M 2 (t + 1) will be a zero row. 
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Lemma 3: Matrix M\(t) is row stochastic. 

Proof: Proceeding inductively, the base case is true since Mi(l) = I. Assume at step t > 1 
that 5^j =1 [Mi(t)]ij = 1 for every row 1 < i < n. At step t + 1 assume that compute node 
i receives ^ messages from node j through different delay chains plus possibly a message 
without delay if L^{t + 1) = 1. Since the self loop message is always delivered without delay 
we know that wa 1 . We have 



3=1 

n 

= [{^xn ~ diag(P[0(t)]l 6 + P[L(t + l)]l n ) 

+ P[L(t + l)])M!(t) + P[0(*)]M 2 (*)] (34) 
^ [(/ nxn - diag(P[<£(t)]l 6 + P[£(* + 1)]1„))m!(*) 



3=1 



3=1 

s 



+ J] [P[L(t + l)]Mx(t) + P[#f)]M 2 (f)] . (35) 



V v 



T 2 

Consider the term Ti first, and notice that 7 nX n — diag(P[0(t)]l& + P|X(t + l)]l n ) is a diagonal 
matrix so we have 



Tx =(1 - [diag(P[0(*)]l* + P[L(t + !)]!„]*) ^[M!(t)] 



=1 - ^l[wij > or Ly-(* + 1) > 0]pij. (36) 

Next let us focus on term T 2 which is composed of two summands. For the first summand we 
have 

n 

£[P[L(f+l)]M!(f)]y 

3 = 1 

n n 

j=l fc=l 
n n 

= ^P[L(t+l)] ifc ^[M 1 (t)] w (38) 

fc=i j=i 
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J^PlHt + Wik (39) 



■£fa»^'k,L„> (40) 



fc=l 
n 



=gi 1 M^D>ofe( t + i) ffi-j + ^ (f + 1) . (41) 

To compute the second summand in T 2 , from Lemma [T] we know that the non-zero rows of M 2 {t) 
are at the same position as those of cj)(t). Observe now that those positions are the same as the 
non-zero rows of J n xb&(t) and thus the non-zero rows of P[(j>(t)]. Assume that at iteration t 
node i receives delayed messages only from the compute nodes in the set Ni{t) C V . Moreover, 
assume node i receives w i7lr > 1 messages from neighbour n r e Ni(t) through different delay 
chains. We have 

n n 

Y}pm)W2{t)h = ^p[0(t)] v [M 2 (t)] :j (42) 

3=1 3=1 

=E E E a + T'( t+ i) [mt)] "" <43) 

j=l n r ^Mi{t) l=l r ' 

= E E ffii „ + t (t+1) E[^ (^) 

n r eNi(t) l=i lUr mrV y j=i 

— ; 77 , ^ m, (45) 

W inr +L inr (t + 1) 



-E^>°l flB + ^ (t + 1) ^ (46) 



So now we see that 



T 2 =^l[L ij (t+l) >0]Ly(*+l); 



j=1 % + L^ + l) 



i=i 



^ 1[% > or L y (* + 1) > 0] (48) 
i=i 

Wij + L i:j (t + 1) 



x " — StTT) ( ^ + L ^ + 1)} (49) 
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n 

= ^t[wij > or L {j {t + 1) > 0]p y , (50) 

and finally 

n 

^^(£ + 1)^=^+^ = 1. (51) 

Therefore Mi(t) is row stochastic for all t. ■ 
Finally, we can state the result as follows. 

Corollary 2: Given a graph G and a row stochastic consensus protocol P, if we run consensus 



on G with random delays up to B using updates ( |24| ) with P(t) given by ([22]), all compute nodes 



of G asymptotically reach consensus on a value v that is a convex combination of their initial 
values. 

Proof: After t iterations we have x(t) = M(t)x(0) where sc(i) is the augmented vector 
containing the values of the compute nodes followed by all the delay nodes. The delay nodes 
do not initially contain any information, so we have [x(0)] n+ i :n+b = 0. After t iterations, 

Xi(t) =M 1 (t)[x(0)] 1:n + M 3 (t)[x(0)] (52) 

=M 1 (t)[x(0)] 1:n . (53) 

So, as t oc, since Xi(t) —> v and M\(t) is row stochastic, v is a convex combination of the 
initial values. ■ 
As a last comment, notice the we achieve consensus on the compute nodes, even though 
the overall matrix M(t) does not have a limit. Specifically, the rows corresponding to delay 
nodes oscillate between zero and non-zero values. However this does not affect the sub matrix 
corresponding to the compute nodes. Notice also, that from this analysis we cannot say anything 
concrete about the rate of convergence. A convergence rate bound in expectation could be 
obtained by applying the Poincare technique from the previous section on E[P(t)]. Alternatively, 
it might be possible to derive a more accurate bound by analyzing the recursions ( |29| ), ( [30] ). After 
realizing that C B = 0, M 2 (t) can be eliminated given enough past terms, and the evolution of 
M\(t) resembles that of the impulse response of a multivariate AR{B) model. 

VI. Push-Sum Consensus 

The previous section studies the behaviour of general consensus protocols using row stochastic 
matrices in the presence of fixed and random delays. In the random delay case the model is a bit 
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involved due to the fact that we need to keep track of which delay nodes are empty, and also a 
compute node does not know how many messages it will receive at each iteration. Moreover, the 
convergence proof needs to be tailored specifically to the model because the resulting matrices 
P(t) are not row stochastic. Even more importantly, we do not have a statement characterizing 
the convergence rate and the limiting state is a convex combination of the initial values at 
each node which is not necessarily the average. In this section we study a different consensus 
algorithm called Push-Sum. As we explain, Push-Sum is a more natural algorithm for distributed 
averaging in networks with delay; it alleviates all the aforementioned complications, simplifies 
the delay models, and always converges to the true average. 

A simple asynchronous version of Push-Sum is proposed and analyzed in Q for complete 
graphs. In |16) the algorithm is analyzed in its general form for any graph. The Push-Sum 
protocol makes use of column stochastic consensus matrices and each node i maintains two 
values: a cumulative estimate of the sum s*(t) and a weight Wi(t). The local estimate of the 
average at each iteration is the ratio xi(t) = The algorithm is initialized by setting 

s(0) = x(0) and w(0) = 1. (54) 

Given the topology of the (directed) network G, we use at each iteration a column stochastic 
matrix P(t) respecting G. At each iteration, node j splits its total sum Sj(t) and weight wj(t) 
into shares jSj(z) = (pij(t)sj(t),pij(t)wj(t)),i £ v\ where Y^i=iPij(t) = 1> an ^ sends to each 



neighbour i the corresponding share Sj(i). Equation ( [55] ) shows the actions performed at each 
receiver; i.e., simply add up all the incoming shares. In vector form the state evolves as 

s(t) = P(t)s(t - 1) and w(t) = P(t)w(t - 1) (55) 

x(t) = ^ (56) 
w(t) 



where the division of s(t) and w(t) is element- wise. We can verify that the updates ( [55] ) satisfy 
a conservation of mass property in the sense that for all t > 0, 

n n 

s ^ = = lTx (°) = nx — (57) 

i=i i=i 

n 

^Wi(t)=n. (58) 

i=i 

To see why Push-Sum converges to the true average even in the time-varying case, assume 
P(t) are sampled i.i.d. such that E[P] is irreducible at each iteration. Then the sequence 
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{P(t)}^i is weakly ergodic (Lemma 4.2 in [16|). Let us call P°° the limit of the forward 
product P(1) T P(2) T • • • P(t) T as t —> oo. As a product of row stochastic matrices, P°° is row 
stochastic with all rows the same. At any node i we have 

„_ [«(0) r P~]. _ [x{0) T P% _ e;=ip^(o) 



Xi(oo) J 



(59) 



(60) 



We use the fact that all rows of P°° are the same; i.e. P-f = P™, for all z, j. For a formal proof 
see JT6| |. Notice that Push-Sum computes the average without using doubly stochastic matrices 
or requiring knowledge of the stationary distribution a priori. 



A. Consensus with Fixed Delays using Push-Sum 

In the case of fixed delays, the construction of a protocol with delays P based on an initial 



protocol P is the same as in Section [IV| The only difference is that we start with a column 
stochastic protocol P and convert it to a new column stochastic matrix P by adding delays one 
edge at a time. For example, if we start with the protocol ([2]), after adding a delay of 2 on the 
edge (1,2) we have 



2 

P= 3 

di 
d 2 



3 d\ d 2 



\ o 1 

\ o o 



(61) 



i o o o o 

1 

In the case of Push-Sum, delay node d\ receives \ of the share of node 1. Using P, average 
consensus is achieved by iterating 

s(t) = Ps(t - 1), w(t) = Pw(t - 1). (62) 

For the purpose of analysis, we initialize the delay nodes with ^(0) = Wf(0) = 0, n + 1 < i < 
n + 6, or in vector form, 



8(0) = [x(0) T 
w(0) = [l T n 







TiT 







TiT 



(63) 
(64) 
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If we run Push-Sum using the delayed consensus protocol P, writing P°° for the limit of P l as 
t — > oo we see that the estimate of the average Xi at each node i will be the true average of the 
initial values: 

, , [P°°s(0)]. [P°°[x(0) T Off]. 
Woo = 4^ = 1 J (65) 

[p~«ko)]. [p~[iZ olV] z 



n 



(66) 



En poo poo i 

j=i r ij r n Z^j=i 1 

since P is column stochastic and P°° will have identical columns. Obviously, the convergence 

rate bound ([9]) applies here as well. 

B. Consensus with Random Delays using Push-Sum 

In row stochastic protocols with random delays, we need an indicator vector <f>(t) to know 
whether a delay node contains information or is empty. We also need to assign the portion of the 
weight that is being unused to the self-loop message. Both of those complications arise from the 
fact that we do not know how many messages will be received at each iteration. With Push-Sum 
consensus however, the semantics suggest that the sending node decides how much weight to 
assign to each outgoing message, and each receiving node simply sums up the incoming s and 
w values without caring about the number of incoming messages. This fact simplifies both the 
model and the convergence analysis when we account for time- varying delays. 

Recall from the random delay model construction that the adjacency matrix A{t) is given 
by ( [T6| ). However, now we are given a column stochastic matrix P and need to construct a 
column stochastic matrix P(t). Since P indicates the outgoing weights, the construction is 
straightforward: 

' diag(P) + P o L(t) J nxb 
P[R(t)] C bxb 

where, by diag(P) we mean a matrix with diagonal entries the same as those of P and off- 
diagonal entries set to zero, and where o denotes entry-wise (Hadamard) matrix multiplication. 
We define the operator P[R(t)] a bit differently than in the previous section. If [R(t)]di,j = 1, 
where d\ is the first node on a delay chain from compute node j to compute node z, then we 
set P[R(t)] d rj = p^. Again for the purpose of analysis we initialize the s and w values for the 
delay nodes to zero. 



p(t) 



(67) 
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With Push-Sum, the model is simplified because we no longer need the vector (j) to indicate 
which delay nodes contain information. The reason is that we have the weights w and an 
empty delay node is represented by having a weight of zero. Notice, in addition, that P(t) is 
column stochastic by construction and does not contain zero columns. This allows us to use 
weak ergodicity theory [[14), p5| | to establish convergence. 



C. Convergence of Push-Sum consensus with Random Delays 

Using the random delay model with column stochastic matrices yields a forward product, and 
to prove convergence of this algorithm we need to establish weak ergodicity as was mentioned 
at the end of Section 



III 



Since each matrix P(t) in ( |67| ) contains zeros on the diagonal, we 
cannot apply known results directly. In this section we derive a worst case (pessimistic) geometric 
convergence rate. We first need the following lemma. 

Lemma 4: If a strongly connected graph G has diameter P, the graph G obtained by adding 
arbitrary delays of up to B on each edge has diameter at most D < (B + 1)D + B + 1. 

Proof: Let K = v —> v\ • • • —> v D _i —> w be a path in G with length equal to P. By 
adding at most B delay nodes per directed edge, each edge of G is replaced by B + 1 edges 
in G and the corresponding path K has length (P + 1)P in G. All neighbours of v and w in 
G belong to K or else the diameter would be longer. Suppose that in the worst case, v has a 
neighbor z\ ^ v\ and w has a neighbor z 2 ^ vd-i in G. After adding delays, the longest path in 
G goes from the delay node in the middle of the longest delay chain between z\ and v and the 
delay node in the middle of the longest delay chain between z 2 and w and has length at most 
D<(B + l)D + 2±± + 2±± = (B + l)D + B + l. ■ 
Now we can state the main convergence result of this section. 

Theorem 3: If we run Push-Sum on a strongly connected graph G using a column stochastic 
protocol P, then in the presence of bounded time- varying delays modelled by ( [67] ), average 
consensus is achieved at a geometric rate. 

Proof: Since G is strongly connected, due to the way we model random delays, at each 
instant t there exists a path between any two compute nodes i and j. As a consequence, due 
to Lemma kl every column j < n of every sub-product matrix P(r, r + D) = P(r) T P(r + 
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1) T • • • P(r + D) T contains positive entrie^] This means that for the (improper) coefficient of 
ergodicity c(-) [25] (p. 137) 

c(F(r, r + D)) =1 - max (min[F(r, r + D)] ks ) (68) 

l<s<n+b k 

<1 - max (min[F(r, r + D)] ka ) < 1 (69) 

1<S<71 k 

since the maximum over the minimum values in the compute node columns is certainly not zero. 
Now, if we run consensus with random delays for t > D steps we divide the forward product 
F(l,t) into 

i_ 
D 

t) = JJ F((k - 1)5 + 1, kD) (70) 

/c=l 

=F(1, 5)F(5 + 1,2D)---F(t-D + 1, t) (71) 

and as explained above, c(F((k — 1)D + 1, fcZ))) < 1 for each term. Now immediately we see 
that Y.kLi 1 ~ c(F((k - 1)5 + 1, fcS)) = oc, and from Theorem 4.9 in 1251, the product 



F(l,i) is weakly ergodic. Based on a derivation similar to ( |59| ), after initializing the 5 and 
w values of the delay nodes to zero, Push-Sum converges to the true average. Furthermore, if 
max fc (F((k — 1)D + 1, kD)) < c < 1, the forward product converges geometrically at a rate 
no worse than c . ■ 

VII. Simulations 

In this section we use simulations to illustrate the important concepts discussed so far. The 
first experiment verifies Theorem [T] and Corollary [Tj One difficulty with verifying these results 
numerically is that Theorem [T] describes the effect of fixed delays relative to a consensus protocol 
P on a graph G without delays. To compute the Poincare constant K explicitly we still need 
to find a set of canonical paths in G and apply fLO] ) which can be tedious. Instead, we estimate 
K as follows. For a given network of 15 nodes, protocol P and delay bound £?, we randomly 



select delays for all edges, construct U (Piazy) as explained in Section IV and compute the second 
eigenvalue of U. For each bound B we repeat this procedure 50 times. Since K > } B — - 

I — \2{U(Pl aZ y)) 

we keep the largest A 2 out of the 50 trials to approximately maximize the lower bound on K. 

5 In other words after D iterations every compute node communicates with every other compute node. 
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500r 




2 4 6 8 10 12 14 16 



Maximum edge delay B 



Fig. 3. (Red) Estimated inverse spectral gap — - . T }~ — for a network G of 15 nodes when increasing the upper bound 

l — \2(U(Pi azy )) 

B of fixed delays. Each data point is the maximum over 50 randomly selected delay distributions over the edges of G. (Black) 
An approximate fit of an 0(B 2 ) curve to show that the inverse spectral gap does not deteriorate by worse than a quadratic 
factor as we increase B. 



Figure [3] illustrates that the inverse spectral gap increases almost quadratically with B. It appears 
that 0(B 2 ) might be increasing faster than K so our bound might be loose but not dramatically 
so. The mismatch could also be a result of poor approximation on K since for larger B, 50 
trials might not be enough to capture the worst possible scenario. 

In a second simulation we investigate the case of time- varying delays. For a network with 5 
nodes and a maximum random delay of B = 5, we plot the evolution of the node values when 



running consensus with equation ([24]) and Push-Sum using consensus matrices of the form ( [67] ). 
We initialize the node values to be the node ids 1 through 5. In both cases we start with a 
random row stochastic protocol P without delays and use its transpose to generate the Push- 
Sum weights. Figure [4] illustrates that since P is not doubly stochastic, the compute nodes reach 
consensus as Corollary [2] suggests, but the consensus value is not the average. Even worse, if 
we run the simulation again, the different random delays at each iteration will yield a different 
consensus value. With Push-Sum, on the other hand, the compute nodes always converge to the 
true average. 

VIII. Concluding Remarks and Future Work 

In this paper we analyze the effect of communication delays in distributed algorithms for 
consensus and averaging. Initially we assume that each directed link of a communication network 
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--■Row Stochastic P 
— Push-Sum 



25 30 35 40 45 50 
Iteration 



Fig. 4. Evolution of the node values on a graph of 5 nodes with random delays no more than B — 5. The true average is 
Xave = 3. (Blue) With Push-Sum all nodes reach consensus to the correct average. (Red) Using a row stochastic matrix, as 
expected consensus is reached but not to the average and the consensus value varies between executions. 

G delivers messages with some fixed delay B. Delays on different links need not be equal. We 
show how to model the effect of delays by augmenting G with artificial delay nodes and then 
use geometric arguments to show that the inverse spectral gap of a consensus protocol P in the 
presence of delays does not increase faster than 6(£? 2 ). Thus, we still have exponentially fast 
convergence to a value which in general is not the average. For fixed row stochastic protocols, 
we can achieve average consensus by rescaling the initial values as explained in Section [TTTJ 

Next, we show how to model time-varying delays — a scenario that is more realistic but also 
harder to analyze. For general row stochastic consensus protocols we show that convergence to 
consensus is still guaranteed although the consensus value is itself a random variable. In the 
last part of the paper we propose and analyze the use of a different consensus protocol based 
on column stochastic matrices called Push-Sum. With Push-Sum, convergence to the average is 
always guaranteed and the analysis of the time-varying delay model is significantly simplified. 
These facts are in agreement with [32], suggesting that Push-Sum is more suitable for practical 
implementations. 

In the future, for the fixed delay scenario we would like to investigate the following 
optimization problem: Given a network G and the fixed delays on its links, what is the consensus 
protocol P that respects the structure of G and reaches consensus as fast as possible in the 
presence of fixed delays? Notice that since we can use Push-Sum, any column stochastic matrix 
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that does not add edges to G will compute the true average and we are looking for the matrix 
with the smallest second eigenvalue. It would be interesting to investigate if the techniques used 
for second eigenvalue optimization for symmetric protocols (see e.g., |12J) could be extended 
to answer this question. 

At the same time, for our time-varying delay models, the analysis only guarantees convergence 
and a loose geometric bound in the case of Push-Sum. It would be useful to have a more precise 
characterization of the convergence rate and to extend the Poincare technique presented in this 
paper to understand how much do time-varying delays slow down convergence. 

IX. Appendix: Proof of Theorem Q] 

Consider a graph G with a consensus protocol P. Given a set of canonical paths Y = {7^} 
on G we can compute the Poincare constant K. If each link of G delivers messages with some 
arbitrary fixed delay of no more than B, we will show that the Poincare constant K of G using 
the lazy additive reversibilization U of P is bounded like K < ZK where Z = Q(B 2 ). 

We start with the definition of the Poincare constant for K and use the path associations 
discussed already to break the sum over all paths into nine summands. Assume that there are 
N vw canonical paths in G that go through the bottleneck edge e = (v,w) of G and let the 
bottleneck edge of G be e = (u, z) where u is in the set v + and z is in the the set w~. Let x, y 
denote the starting and ending node of a path 7^ We have 

K = ^ * (T x [x ^y]+ T 2 [x -+ y~] + T 3 [x -+ y + ] 
+ T 4 [x- ^ y~] + T b [x~ ^y} + T 6 [x~ -> y+] 

+ T 7 [x + -+ y~] + T 8 [x + ^ y] + T 9 [x + ^ y + ]) (72) 



with 



Tl = ^ \7i\KxT?y (73) 



T 2 = Yl Yl (I7i| + ^)W (74) 



2 

N vw deg(y) \ 



r 3= E EE(i^i+*)w ^ 

July 26, 2012 DRAFT 



34 



N vw deg(x) -1 -1 

E E E E + (76) 



i=l,e€ji h=l ■_ B h h _ B- 

J— 2 K ~ 2 

N vw deg(x) -1 



E E E (77) 



i=l,e€ji h=l ■_ B h 

■J— 2 

N vw deg(x)deg(y) -1 \ 



T «= E E E E E(N+j' +fc )w (78) 



i=l,eG7i h=l r=l -_^h k=l 

-1 -1 



Tr= £ J] J] (l7<l+j + fc)Wy- (79) 



2 ,v 2 

TV^^ —1 



r 8 = E (l7il+i)^ + % ( 8 °) 



i=l,ee% a— b+ 

J— 2 

JV™ deg(y) -1 ^ 

r ^= E E E E(i7ii+j+*)^v (si) 

i=l,ee% r=l ■_ B+ k=l 

J ~ 2 

To obtain a cleaner bound for K we assume that P is doubly stochastic, recalling that the 
stationary distribution of delay nodes is n x * < p7r x = ^ for p = max^-(p^) and replacing * 
with either +, — . Recall also that each path in G corresponds to exactly one path in G. Below 
we show how to bound the term T 6 ; bounds for all of the other terms defined above are obtained 
using similar arguments. Observe that for every path j xy between compute nodes x and y, if 
^ xy goes through a bottleneck edge e in G, then all the delay paths 7 that are associated with 
j xy will go through e in the middle of the delay chain that replaces e. So, for term T 6 we have 

N vw deg(x)deg(y) -1 \ 

r»< E EEE £((* + DN+> + *) 

i=l,eG7i /?=1 r=l -_^h k=l 

J— 2 

x (82) 

c c 



S de g(^) de g(?/) 



z=l,eE7 



-1 f 



X 



^2((B + l)\ 7i \+j + k)7r x 7r y (83) 



J ~ 2 



Now since all paths 7^ are at least one edge long, bounding the node degrees by the maximum 
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degree d max in G gives 



-i f 



• y— 2 

x ^ (84) 

i=l,e€7i 



c 2 4 



E |7iKx7Ty (85) 



Through a similar derivation, all nine terms can be bound by a constant times J2i=i e e 7i 1 7*1 ^x^y 
which appears in the expression for the Poincare constant K without delays (see fLO])). To make 
the exact expression for K appear, we focus on the leading term in ( [72] ) to see that 

1 c 2 



([U] uz + 0)c 



(86) 



=— = (87) 

71 v C C T^vPvw 

Next, remembering that e = (v, w) is the bottleneck edge, after computing the exact constants 
in all terms, we write K < ZK where Z is a function of the node degrees, edge delays and 
consensus matrix P. Specifically, 

2Pv 



K <~^ w 



, n 1N 3B 2 + 2B 1 5B 2 + 6B 
(B + 1) + p Yp d max 



o , B 3 , W 2 + 2B 2 )2 B 3 + B 2 
+ P d max — + pd max h p d n 



max 



4 

2 B 3 3B 2 + 2B 2l B 3 + B 2 ^ 
+ P — + P Vp d n 



jyvw 

x— ^- h\w v = ZK. (88) 



K 

Finally, focusing on the expression for Z, after some algebra, we see that 



Pvw 



p 2 (2d 2 max + 3d max + l)B 3 
+ p(2pd 2 max + 2pd max + 8d max + 6)B 2 
+ (8pd max +p + 8)B + 8 
which completes the proof. 
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